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Abstract 

We present a new model for rollback recovery in dis¬ 
tributed dataflow systems. We explain existing rollback 
schemes by assigning a logical time to each event such 
as a message delivery. If some processors fail during an 
execution, the system rolls back by selecting a set of log¬ 
ical times for each processor. The effect of events at 
times within the set is retained or restored from saved 
state, while the effect of other events is undone and re- 
executed. We show that, by adopting different logical 
time “domains” at different processors, an application 
can adopt appropriate checkpointing schemes for differ¬ 
ent parts of its computation. We illustrate with an ex¬ 
ample of an application that combines batch processing 
with low-latency streaming updates. We show rules, and 
an algorithm, to determine a globally consistent state for 
rollback in a system that uses multiple logical time do¬ 
mains. We also introduce selective rollback at a proces¬ 
sor, which can selectively preserve the effect of events 
at some logical times and not others, independent of 
the original order of execution of those events. Selec¬ 
tive rollback permits new checkpointing policies that are 
particularly well suited to iterative streaming algorithms. 
We report on an implementation of our new framework 
in the context of the Naiad system. 


1 Introduction 

This paper is about fault tolerance in distributed dataflow 
systems. Specifically, we investigate the information that 
must be tracked and persisted in order to restart a system 
in a consistent state after the failure of one or more pro¬ 
cesses. We assume other requirements, such as detecting 
failures and reliably persisting state, are adequately cov¬ 
ered by existing techniques. We describe a general mech¬ 
anism and an implementation of it in the context of the 
Naiad 112 system. We also suggest how the ideas may 
be applied to other distributed systems. The mechanism 
is named after the Falkirk Wheel 12, a prior engineering 
solution for high-throughput streaming rollback. 

Most fault-tolerant distributed systems adopt a fixed 
policy for checkpointing and logging. As a result, all 
applications running on these systems must operate with 
the same set of performance tradeoffs. Streaming ap- 
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Figure 1: A complex streaming application. Differ¬ 
ent parts of the computation have different availability, 
throughput and latency requirements, and thus merit dif¬ 
ferent fault-tolerance policies. 


plications often require high availability, i.e. the system 
must resume output soon after the detection of a failure. 
Systems designed for these applications must be able to 
restore quickly to a recent consistent state on failure, 
meaning they must frequently update persistent state. 
Other applications may be more sensitive to throughput 
or latency, which are hard to maintain while eagerly writ¬ 
ing to stable storage. These conflicting application re¬ 
quirements are a major motivation for the development 
of multiple systems such as Spark m, Storm 0,84 a 
and Millwheel fb). We argue that such systems would be 
more useful if they could mix policies, and thus perfor¬ 
mance tradeoffs, within a single application. 

Consider the application in Figure User queries ar¬ 
rive at the top left and are joined with two sets of data; the 
output of a periodic batch computation; then the output 
of a continuously-updated iterative computation. Statis¬ 
tics about the query response are then stored in a database 
and the response is delivered back to the user. Concur¬ 
rently the application receives a high-throughput stream 
of data records. Some fields of these records are directed 
to the batch computation, which is re-run periodically. 
Other fields are used as inputs to the iterative computa¬ 
tion which updates in real time. 

The application adopts four separate fault-tolerance 
regimes for different regions of the computation, indi¬ 
cated by the different shading used for different parts of 














the dataflow illustration. The first we call “ephemeral” 
which means that the records flowing through this part 
of the graph are never saved to stable storage, and none 
of the dataflow vertices they pass through store muta¬ 
ble state. Clients that introduce ephemeral records (users 
sending queries or the external service supplying high- 
throughput data) do not receive an acknowledgement un¬ 
til the records have flowed through the entire ephemeral 
subgraph, so fault tolerance for these records is attained 
by requiring clients to retry on failure. Data reductions 
are performed on the high-throughput input records be¬ 
fore they leave the ephemeral regime. The second regime 
is “batch.” In this part of the graph there is a high- 
throughput data-intensive computation that is run period¬ 
ically and can tolerate re-execution that introduces a high 
increase in latency (perhaps of minutes) in the case of a 
failure, since the results of the computation are never re¬ 
quired to be fresh. The third regime is “lazy checkpoint.” 
This is used for the real-time analytics subgraph which 
maintains complex state that must be regularly check- 
pointed. In the event of a failure it is acceptable to re- 
execute a few seconds’ worth of work in this regime, 
so checkpoints need not be taken every time state is up¬ 
dated. The final regime is “eager checkpoint.” This is 
used for the database updates which must be persisted as 
soon as they are recorded, since they must be consistent 
with delivered results. There exist fault tolerance designs 
that fit several of these regimes, but no current system can 
include them all in a single application as we desire. The 
Falkirk Wheel framework makes this flexible mixture of 
policies possible. 

In common with prior work ||9l we propose to re¬ 
cover from a failure by restoring processes to previously- 
checkpointed states, optionally replaying logged events 
such as message deliveries that occurred after the check¬ 
points were taken, then restarting execution. Many stan¬ 
dard checkpointing and logging techniques can be un¬ 
derstood in terms of events tagged with partially-ordered 
logical times. After a failure the effect of events at log¬ 
ical times in a chosen set is restored from saved state, 
and events with times outside the set are re-executed. 
This paper makes two major contributions. First, we 
show how different subgraphs of a dataflow can make use 
of different logical time domains. This permits differ¬ 
ent styles of checkpointing, with different performance 
tradeoffs, to coexist within a single fault-tolerant appli¬ 
cation. We set down simple rules and a general algo¬ 
rithm for choosing a consistent global state after a failure, 
taking into account these different time domains. Sec¬ 
ond, we introduce the concept of selective rollback. This 
means that a process that has processed events at two dif¬ 
ferent logical times t\ and t 2 may be able to preserve the 
work for time t\ after rollback but undo and re-execute 
the work for f 2 , independent of the order in which the 


work was originally performed. We show that selective 
rollback allows new performance tradeoffs that are par¬ 
ticularly well-suited to high-throughput, low-latency sys¬ 
tems such as Naiad. 

Our implementation targets the Naiad system, which 
previously had only basic support for fault tolerance. 
Naiad adopts a single underlying system mechanism and 
implements different computational models as libraries. 
Our design allows each library to adopt a checkpointing 
policy tailored to its performance characteristics, while 
still allowing the libraries to interact within a single ap¬ 
plication. Since Naiad supports sophisticated streaming 
algorithms that may include nested loops, it is a good 
testbed for general fault tolerance mechanisms. The 
ideas set out in this paper are applicable well beyond 
Naiad, and their implementation in a system without 
cyclic dataflow would be simpler. For example, we be¬ 
lieve that some of the techniques we describe could be 
used, with modest effort, in the context of the Spark 
Streaming system M- 

The next section sketches a number of popular fault 
tolerance policies and explains selective rollback. Sec¬ 
tion sets out the Falkirk Wheel design, and Section 
describes its implementation in the Naiad system. We 
finish with conclusions. 

2 Tracking events for rollback 

In this section we summarize a few rollback recovery 
schemes and comment on the design and performance 
tradeoffs they embody. In our discussion we refer to a 
processing node in a dataflow graph as a processor. A 
physical CPU in a distributed system may host multiple 
such processors. Later, we will fit several of the schemes 
into our common framework. In order to do this it is 
helpful to think of messages sent between processors as 
being tagged with partially-ordered logical times; often 
these tags are implicit. Many systems can inform a pro¬ 
cessor when it will not see any more messages with a par¬ 
ticular logical time t. We call this a notification at time 
t. An event at time t means the delivery of either a mes¬ 
sage or a notification with that time. In the following we 
divide logical times into two broad categories; sequence 
numbers; and structured times, which include epochs. 

2.1 Sequence numbers 

Sequence numbers on ordered channels are illustrated in 
Figure]^ a). There is no need for notifications when us¬ 
ing sequence numbers, since each message has a unique 
time. Rollback schemes that we model using sequence 
numbers are often used for systems where computation 
is not naturally structured using epochs. Such schemes 
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Figure 2 : Logical times for events. Tuples (•) on input edges represent messages that have not yet been processed: 
the tuple shows the logical time of the message. Tuples on output edges represent sent messages. In Scheme (a), the 
logical time of a message with sequence number s on edge e is {e,s). p has processed the first 4 messages on edge 
61 and the first 7 on 62, and has sent 3 messages on 63 and 4 on 64. Scheme (b) uses epoch numbers as logical times, 
so all messages in a given epoch have the same time, q has processed all the events in the first two epochs, and has 
sent all corresponding messages for those epochs. Scheme (c) uses structured logical times, generalizing epochs, r 
forwards incoming messages into a loop, which has a different time domain that includes an additional loop iteration 
counter, r has processed all events in the first epoch and sent all the messages it will ever produce with epoch 1 and 
any iteration count. The frontier f{x) and edge projection 0 (e)(/(x)) at processor 2c are discussed in Sectionlsj 


include the following: 

Distributed Snapshots. Chandy and Lamport de¬ 
scribed a general algorithm for checkpointing an arbi¬ 
trary distributed system || 7 ]. Each process p receives 
messages from other processes in the system on a set 
of point-to-point channels E{p). Periodically the system 
performs a global checkpoint: it chooses, for each pro¬ 
cess p and channel e G E{p), a sequence number Se, and 
records the state Cp of p after all the messages up to Sg 
have been delivered on e and no others. The checkpoint 
also includes a sequence of undelivered messages Mg on 
each channel e. The design of the algorithm ensures that 
the chosen {Cp},{Me} form a consistent global system 
state. Following a failure the system is restored to the 
state at the most recently saved checkpoint. This scheme 
is general, but has some practical drawbacks. Each pro¬ 
cess must be able to save a checkpoint at an arbitrary 
moment chosen by the system, which introduces over¬ 
head that is side-stepped by some designs below. Also 
in general all processes, even non-failed ones, must roll 
back to a prior checkpoint following a failure. 

Exactly-once streaming. Streaming systems includ¬ 
ing Storm la and Millwheel ||6| support stateful proces¬ 
sors to which a message is guaranteed to be delivered 
exactly once, corresponding to the “eager checkpoint” 
regime of Figure [T] On receiving a message a processor 
persists its updated state and any resulting outgoing mes¬ 


sages before acknowledging the processed message. As 
with the Chandy-Lamport algorithm, the persisted state 
encodes the effect of processing all messages up to the 
latest sequence number on each input, and no others. If a 
processor fails it is restored to its most-recently persisted 
state, which includes the effect of all acknowledged mes¬ 
sages. This scheme has several benefits: it allows proces¬ 
sors to choose locally when to checkpoint; it can guaran¬ 
tee high availability; non-failed processors need never be 
interrupted; and processors may join and leave the com¬ 
putation with low overhead since the system need not 
keep track of the dataflow topology. Drawbacks include 
a possible throughput penalty because all mutations to 
state must be persisted, and a possible latency penalty 
because sent messages must be acknowledged by their 
recipient process before the next incoming message can 
be acknowledged. The chain of dependent acknowledge¬ 
ments that builds up as a message’s effects propagate 
may also limit the practical complexity of computations; 
for example iterative algorithms may be problematic 

At-least once streaming. Both Storm and Millwheel 
also allow processors to be placed in a relaxed fault toler¬ 
ance mode, in which the system does not eagerly check- 

* Millwheel addresses some latency concerns by partitioning the 
State at each processor by a key function and performing work for 
distinct keys in parallel. It can also notify a processor when a low- 
watermark has passed, based on wall-clock timestamps. These notifi¬ 
cations are not the same as the logical-time notifications in this paper, 
and we can model them as messages delivered on a virtual edge. 
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point each state update before proceeding to the next. 
This gives better performance, but must only be used 
for processors for which message deliveries are idem- 
potent, or where it is tolerable to end up in a globally- 
inconsistent state. It is suitable for the “ephemeral” 
regime in our example. 

2.2 Epochs 

Some systems associate each input message with a par¬ 
ticular batch or epoch, and structure computation (often 
using dataflow) so that all consequent messages and state 
updates can in turn be tagged with an epoch. These 
epochs can be used as coarse-grain logical times for 
events as illustrated in Figure |^b). 

A number of recent acyclic batch dataflow systems El 
noma share a fault tolerance model pioneered by the 
MapReduce system El. Each processor reads all of its 
inputs then implicitly receives a notification that the in¬ 
put is complete, writes its outputs, empties its state, and 
quiesces. We can think of all inputs and messages as be¬ 
ing in a single epoch 0. Each system design specifies a 
subset of the edges in the dataflow and persists the mes¬ 
sages sent on those edges. Eollowing a failure the sys¬ 
tem chooses to restore each failed processor either to the 
state where it has processed no events, or where it has 
processed all events, based on a global function of which 
sent messages have been persisted. 

This design has the appealing property that processors 
are always restored to an empty state after failure: this 
means that the (user-supplied) application logic in the 
processor need not include any checkpointing code. On 
the other hand, any work in progress at the time of a fail¬ 
ure is lost and must be redone. Non-failed processors 
need only be interrupted if they have consumed mes¬ 
sages from processors that were restored to the empty 
state. The model is well suited to off-line data-parallel 
workloads, where throughput in the absence of failures 
is paramount and delayed job completion is tolerable in 
the event of failures. A variation on the model. Spark 
Streaming m, allows each processor to accept mes¬ 
sages at an epoch t + \ after the messages at epoch t 
have been fully processed, matching the “batch” regime 
of our example. Unlike traditional streaming systems it 
does not let processors retain internal state between log¬ 
ical times. 

2.3 Selective rollback 

The Naiad system ifTSll achieves state of the art perfor¬ 
mance on the streaming iterative workload needed for 
the “lazy checkpointing” regime of our example, so we 
consider its fault-tolerance requirements. Naiad explictly 
assigns logical times to events. Each time is a tuple indi- 



Figure 3: Selective rollback. Rectangles show messages 
and ovals processor state. A white background indicates 
a message or state corresponding to logical time A; a grey 
background to time B. The dashed line shows the point 
at which a processor will not receive any more messages 
at time A; a notification is delivered to the Sum processor 
after this point, causing it to send a message and discard 
its state related to A. Processors roll back to a state where 
they have consumed all messages at A and none at B. 

eating an input epoch along with loop counters tracking 
progress through (possibly-nested) iteration as in Fig¬ 
ure [^c). A processor can request that a notification be 
delivered when a logical time is complete. Figure 
shows a fragment of a simple Naiad dataflow graph made 
up of Select, Sum and Buffer processors. Below it is a 
timeline showing event deliveries and corresponding up¬ 
dates to the processor state, colored according to logi¬ 
cal times. The Select processor translates a word into its 
numeric representation, and is stateless. The Sum pro¬ 
cessor accumulates a separate sum for each logical time. 
When notified that there will be no more messages at a 
given time. Sum outputs the accumulated sum for that 
time and then removes the sum from its local state. The 
Buffer processor records all messages it has seen. 

All the Naiad computational libraries developed so far, 
including differential dataflow im which is the most 
complex, either keep no state at a processor or partition 
its state by logical time. Many Naiad processors, like 
the Sum in our example, delete the state corresponding 
to a time once that time is complete. It is thus desir¬ 
able to allow a processor to wait until time t is com¬ 
plete before checkpointing the portion of local state that 
corresponds to t. Often this means no checkpoint need 
be saved, matching the software-engineering and perfor- 


4 















mance characteristics of the systems in Section 2.2 


Naiad applications often include loops implemented 
as distributed sets of processors, and messages can flow 
around these loops with latencies of a millisecond or less. 
Restricting Naiad to suspend delivery of a message until 
all messages with earlier times had been processed would 
force a processor to stall waiting for the global coordina¬ 
tor to ensure that no “earlier” messages remained in the 
system, introducing a severe performance penalty. Con¬ 
sequently, Naiad processors may interleave the delivery 
of messages with different logical times. 

We introduce the idea of selective rollback in order 
to support Naiad’s twin performance requirements that 
processors must be able to interleave the logical times of 
delivered messages, and also checkpoint only state corre¬ 
sponding to completed times. In Figurej^each processor 
makes a selective checkpoint after seeing the last time 
A message. Rather than saving its full current state, as 
is traditional, it saves the state it would contain having 
seen all time A messages and no time B messages. In 
general this checkpoint may not correspond to a state the 
processor has previously been in. The shaded rectangle 
shows a rollback during which each processor is set to its 
checkpointed state. Subsequently an upstream processor 
is re-executed, causing the time B message to be re-sent, 
and eventually the state of the system returns to that be¬ 
fore the rollback. A scheme that did not support selective 
rollback would be forced to prevent the interleaved deliv¬ 
ery of messages at different times, or to checkpoint non¬ 
empty state for the Sum processor, either of which would 
introduce a substantial performance penalty for Naiad. 


3 The Falkirk Wheel framework 

We now describe our general framework for rollback us¬ 
ing logical times. As previously mentioned, after a fail¬ 
ure the system chooses a set of logical times at each pro¬ 
cessor, which we call a frontier, and restores the pro¬ 
cessor to a state including the effect of the previously- 
delivered events with times in that frontier. We first dis¬ 
cuss some restrictions on the use of logical times in our 
framework, and show that the existing schemes described 
in Sections [ 2 . 1 | and [Z 2 | satisfy these restrictions. We then 
discuss a general algorithm for choosing frontiers that 
will result in rolling back to a globally consistent state. 


tier containing that set. The schemes described in Sec¬ 
tions 2.1 and 2.2 already naturally adopt frontiers for 


rollback. For epochs logical times are totally ordered, 
so the restriction simply means that if we are rolling 
back to epoch f we must also include all previous epochs 
t' < t. For sequence numbers, recall that a logical time 
is a pair (e, s) where e is an edge and s is the sequence 
number of a message on that edge. We define a par¬ 
tial order on these times where (ei,si) < (e2A2) if and 
only if ei = 62 Asi <S2- This means that times are only 
comparable if they correspond to messages on the same 
edge, and within an edge sequence numbers indicate the 
natural ordering. For a processor with incoming edges 
ei... e„ we associate the state in which the processor has 
consumed all messages up to s,- on edge e,- with the set 

fei ('*11 • ■ ■ j ■^n) 


This set is a frontier under the partial order above, and 
corresponds to the messages whose effects are included 
in a checkpoint at that state. Figure]^ a) shows the fron¬ 
tier/(p) =,,2 ( 4 > ?) ■ 


3.2 Bridging time domains 

The edge projection functions 0 (e) shown in Figure]^ 
allow us reason about rollback in a system containing 
processors with different logical time domains. For each 
edge e from processor ptoq,^{e) (/) maps a frontier / at 
p to a frontier in the time domain of q. The function 0 (e) 
must be consistent with the behavior of p: it is a conser¬ 
vative estimate of the times that were “fixed” on e given 
the events in / at p. Specifically, p is guaranteed not to 
have produced any messages with times in 0(e) (/) as a 
result of processing an event with a time outside /. In¬ 
formally, this means it is “safe” to roll q back to 0 (e) (/) 
as long as p rolls back to a frontier at least as large as /. 
We could always set 0 (e) (/) = 0 , but instead would like 
to choose it as large as possible since larger 0 will allow 
us to preserve more work during rollbacks. 

In rollback schemes that use sequence numbers 
0(e)(/) is defined naturally as illustrated in Figure]^ a). 
Suppose that when p is in state (ii,...,i„) it has 

sent s messages on outgoing edge e. Then 

0(e)(/ei,...,e„ (*i) ■ • •:*«)) = {(e, 1), ■ ■ ■, (e,s)}. 


3.1 From sets to frontiers 

Not all sets of logical times can be used as frontiers; a 
frontier must be downward-closed. This means that if a 
time t is in the frontier, then so is every time t' < t. For 
a set T of times we write \.T = {f' ; f G T A f' < f} for 
the operation that converts a set into the smallest fron¬ 


(Conveniently, for our purposes we need not define 
0(e) (/) for any frontier that does not correspond to 
a state in the history of p.) Systems that use epochs 
typically adopt the restriction that messages cannot be 
sent backwards in time. For these systems we can set 
0(e)(/) = / everywhere, meaning an event at epoch t 
cannot result in a message at any epoch t' <t. 
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Figure 4: Processor p filters its history on rollback. H{p) shows a sequence of events at p. Earlier events are to 
the left. There are three delivered messages, then a notification, then another message. He(//(/>)) shows the messages 
sent on edge e; so p sent two messages with time 4 on 63. The border around an event or sent message shows the time 
of the event at p; so the first message on 63 was sent at time 3, and the second at 4. The state of p after a rollback to 
the frontier / = {(!), (2), (3)} is shown below the dotted line. The history and sent messages are filtered to retain only 
the events in / at p. M{e\,f ), M{e 2 ,f) andN{p,f ) are the minimum frontiers containing the processed messages and 
notifications, respectively, in p’s filtered history. The processor logged all sent messages on 64 and none on 63. 


Figure l^c) shows an example of a processor that re¬ 
ceives messages tagged with epochs and forwards them 
in a new time domain: sent messages have times (f,c) 
where t is the epoch of the incoming message and c a 
loop counter. In this case we can choose to be 

{(f,c) : f S /}, so (j) “translates” between time domains. 

Even in systems without loops, it may be useful to 
translate between time domains. A processor p may want 
to read from a computation structured using epochs and 
forward its input to a processor that takes eager check¬ 
points according to sequence numbers. In this case we 
might require p to forward all epoch 1 data before send¬ 
ing any epoch 2 data, if necessary buffering epoch 2 
data until epoch 1 is complete. Suppose that in total 
p receives 73 messages in epoch 1, we could choose 
0(e)({l}) = 73}. A similar transformer could 

translate from sequence numbers to epochs, for example 
to construct epochs from sets of messages received at a 
processor within particular windows of wall-clock time. 


3.3 Message re-ordering 


We must impose a restriction on the semantics of pro¬ 
cessors that will be subject to selective rollback. This 
does not affect the schemes described in Sections 12.11 
or 2.2 which never perform selective rollback. We re¬ 


quire that such a processor p must be able to perform a 
limited re-ordering of messages on its input edges. Sup¬ 
pose e is an input edge to p, and contains a sequence 
of messages {nii ,..., m<.) where mi is at the head of the 
sequence, i.e. mi was sent before m 2 , and so on. Then 
p is at liberty to choose to remove and process from e 
any message m, where time{mj) ^ time{mi) V/ < i. So 
if ms is in epoch 1 and all of mi.. .^4 are in epochs 2 
or greater, p can choose to process ms next. It does not 


have to be the case the p produces the same output under 
all re-orderings, but all of the outputs have to correspond 
to legal behaviors of the computation. This restriction is 
intuitively necessary if we want to legally be able to roll 
p back to a state in which it has processed all the epoch 
1 events and none from later epochs, independent of the 
order that the messages appeared on e. It is satisfied by 
all Naiad processors we are aware of. 

3.4 Checkpoints and processor history 

When deciding what frontiers a processor can be rolled 
back to we need to take into account exactly what infor¬ 
mation p has persisted. For example, processors can in 
general only roll back to a fixed set of frontiers for which 
they took checkpoints. Also, some processors log sent 
messages and others do not. 

We start with notation. H[p) is the history at p at the 
time of the rollback, i.e. the sequence of events that it 
has processed, and H{p)@f is the subsequence of H{p) 
keeping only events with times in a frontier /. (For pro¬ 
cessors that don’t perform selective rollback, H{p)@f is 
always a prefix of H{p).) For e G Outg{p), the output 
edges at p, Yle{H{p)) is the sequence of messages that p 
sent on e as a result of processing the events in H{p), 
and Yle{H{p)@f) is the sequence of messages that p 
would have sent on e if it had processed only the events 
in H{p)@f. When H{p)@f is not a prefix of H{p), 
Ile{H{p)@f) may not be a subsequence of Yle{H{p)), 
though it is for all the processors we have studied. Fig¬ 
ure shows an example history. 

In general we don’t have access to H{p), Yle{H{p)), 
H{p)@f, or Yle{H{p)@f). Instead, we assume that 
there is some sequence of frontiers F* (p) = {/i,... ,/„}, 
where /, C fi+i, that are available for p to roll back to 
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F* {p) Set of available frontiers 

For each f G F* {p) 

S{p,f) Internal state at / 

N{p, f) Processed notification frontier at / 


For each f G F* (p), d € Ine{p) 

M{d,f) Processed message frontier from d at f 

For each f G F*{p), e G Outg{p) 

(j) (e) (/) Edge projection on e at / 

L{e,f) Messages logged on e at / 

D{e, f) Discarded message frontier on e at / 


nally, in the common case (as in Section 2.21 that p is 


an epoch-based processor that keeps no state between 
epochs, sends all messages with the epoch of the event 
that caused the message, and doesn’t log any messages, 
it need not persist anything. Such processors can adopt 

S{pJ)=(d L{e,f) = {) 

0(e)(/) =Mid,f) =N{p,f)=D{eJ)=f 


and need not even save F*{p) since they can restore to 
any requested frontier. 


Table 1: State that must be available to processor p on 

rollback. Most processors can approximate some values 
and do not need to explicitly persist all of them. 

because it has persisted appropriate information about 
them, summarized in Table For a processor that has 
not failed, F* {p) may contain the special frontier T that 
includes all event times. 

For each f G F* (p) we assume p has persisted enough 
information to recover (j) (e) (/) for each e G Outg(p) and 
to be able to restore its internal state to S{p){f), which 
reflects the effects of all events in H{p)@f. Depending 
on p’s policy it may have logged some, all, or none of 
its sent messages. We write L{e,f) for the subsequence 
of He{H{p)@f) that have been logged and D{e,f) — 
Tlt,{H{p)@f) \L{e,f) for those that were discarded. Let 
M{d,f) be the sequence of messages on c/ S Ine{p), the 
input edges to p, that were processed by p in H{p)@f, 
and N{p,f) the sequence of notihcations processed by 
p in H{p)@f. For each / G F*{p) we assume that p 
has stored a conservative estimate of M{d,f), N{p,f), 
and D{e,f), respectively the smallest frontier containing 
its delivered messages, notihcations, and discarded mes¬ 
sages: 

M{d,f) = j,{f : {d,m) G M[d,f) /\t = time{m)} 

Nip,f)=i{t:tGN{p,f)} 

D{e,f) = l{t : m G D{e,f) At = time(m)}. 

Note that time{m) for m G D{e,f), and thus also D{p,f), 
is in the domain of the process that will receive the mes¬ 
sage, not p’s time domain. 

In many cases, p need not explicitly store all the state 
in Table [2 For most schemes that use structured times, 
including epochs, 0(e) (/) is independent of p’s history. 
It is always safe to overestimate M{d,f) — N{p,f) = f. 
If the processor logs all messages, D{e,f) = 0. For 
most processors that discard all messages it is safe to 
use the approximation D{e,f) = 0(e)(/), though pro¬ 
cessors that send “into the future,” like some differen¬ 
tial datahow processors ini, must explicitly keep track 
of which times they have discarded messages for. Fi- 


3.5 Consistent frontiers for rollback 

In the event of one or more failures, the system must 
choose a frontier /(p) at each processor p such that the 
system as a whole rolls back to a consistent global state. 
We list a set of constraints that, if satished, ensure a con¬ 
sistent rollback. We have published a theoretical paper 
that proves the correctness of the constraints. We show 
via a rehnement mapping that a system which obeys the 
Falkirk Wheel rollback constraints on failure implements 
(has external effects indistinguishable from) a higher- 
level system without failures. 

The hrst constraint says that a processor p may not 
restore to a frontier / if there is any message m awaiting 
delivery on an edge e G Ing{p) with time{m) G f. This re¬ 
striction can be satished by saving a checkpoint for fron¬ 
tier / only after all the times in / are complete at p. This 
behavior is already adopted by the systems described in 
Sections |2.1| and |2.2| and is easy to enforce for systems 
such as Naiad that support notihcation. 

The next constraint deals with discarded messages: 

Ve e Oute{p), D{e,f{p)) C f{dst{e)) 

where dst{e) is the processor that p sends to on e. Infor¬ 
mally, this says that a processor downstream of p cannot 
roll back so far that it would need to re-receive any mes¬ 
sages that p has discarded. 

The third constraint deals with delivered messages: 

yd G Itieip), M{d,f) c <p{d){f{src[d))) 

where src{d) is the processor that sends to p on d. This 
says that a processor must roll back far enough that any 
delivered messages are within the frontier “hxed” by the 
upstream processor’s rollback, in the sense described in 
Section lT^ 

The hnal constraints deal with notihcations and 
are motivated by the example in Figure in which 
0(e) (/) = / for all e. Processors p and q have each re¬ 
ceived a notihcation for time 1, in response to which p 
sent a message at time 1 on ei and q did nothing. The 
message arrived at r, which sent nothing in response, at 
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^ «(p) = <(!)> 

{( 1 )} 

Q e, — KD 

Hiq) = {(!)) W(r) = ([ei,(l)]> = {(1)> 

fip) = 0 fir) = {(1)} fix) = {(1)} 


Figure 5: Without notification frontiers rollback can 
lead to inconsistent state. See Section 1331 for details. 


which point x received a notification for time 1 indicat¬ 
ing that it will not receive any more time 1 messages. 
According to the preceding constraints the system could 
roll back to the frontiers shown in the Figure; in particu¬ 
lar /(p) can be set to 0 since M(e2, {(1)}) = 0- Suppose, 
after rollback, q behaves differently on receiving the no¬ 
tification and sends a message at time 1 on 62 , which 
r forwards on e^- Then x will receive a new message 
at time 1 even though it has rolled back to a history in 
which it received a notification promising this will never 
happen. The problem cannot be fixed by simply adding a 
new pairwise constraint between r and x: in the example 
they already roll back to the same frontier. Instead we 
introduce an auxiliary variable at each p, the notification 
frontier fn{p), and add additional constraints: 

fnip)^f{p) 

NipJ)^fn{p) 

Wd G Ineip), fnip) c 0(c/)(/„(irc(t/))). 

The notification frontiers are not used in the rollback; 
they simply act to constrain f{p) to ensure consis¬ 
tency. Notification frontiers can be “omitted” by setting 
^{p^f) = fn{p) = 0 everywhere in systems without no¬ 
tifications. 


Initially: Vp, /(p) = /„(p) = max{/ G F*{p)). 

Continue until fixed point: 

f'ip) = max{g G F* (p) such that g C /(p) 

A Ve G Owfe(p), D{e,g) C fidst{e)) 

A Vc/ G Ineip), I^id,g) C ^{d){f{src{d))) 
/\Nip,g) C <l)id){f„isrcid)))} 

fnip) = max{g„ such that g„ C /'(p) n/„(p) 

FN{p,f'{p))fkgn 

A Vc/ G Ineip), Sn C 0(c/)(/„(irc(c/)))} 

Figure 6: Algorithm to choose consistent frontiers for 
rollback. 


since it will always be satisfied). For these systems, 
adding choices of / to F* ip) at any p will never cause 
fip') to get smaller for any p '—a valid set of frontiers 
remains valid as more checkpoints are saved. 

After frontier /(p) is chosen for rollback at p, its state 
is reset as follows: 

F*'ip) = {f':f'&F*ip)^f'^fip)) 

H'ip)=Hip)@fip) 

S'ip) = Sip, fip)) 

Q'ie) = L(p,/(p))^/(c/sf(e)) Ve G Outeip) 

where Q'ie) is a sequence of messages to send on e 
and L(p,/(p))^/(c/if(e)) is the messages in Lip, fip)) 
whose times are not contained in fidstie)). Figure |7] 
shows some examples of dataflow graphs with different 
characteristics, and the frontiers that are chosen for roll¬ 
back. 


3.6 Choosing consistent frontiers 

Figure shows an algorithm to find a frontier at each 
processor that will satisfy the constraints. As long as 
0 G F* ip) V p, meaning every processor can roll back to 
its initial state, it is always possible to choose values for 
f and while executing the fixed point. In this case 
the algorithm will always converge since neither / nor 
f„ ever increases, and /(p) = fnip) = 0 V p satisfies all 
constraints. 

The choice of f'„ ip) indicates a maximum over a sub¬ 
set of all frontiers. If frontiers are not totally ordered, any 
maximal element can be chosen. In all practical systems 
we have considered either frontiers are totally ordered, 
notifications are not supported, or Nip, f) = /(p) every¬ 
where (so fnip) = fip))- In such systems the algorithm 
will at every p return the maximal globally-consistent 
frontier (and the term Nip, f'ip)) gn is unnecessary 


4 Fault tolerance in Naiad 

In order to evaluate its performance and ease of use, we 
have added prototype support for Falkirk Wheel fault 
tolerance to Naiad HU. Naiad is structured as a low- 
level system layer, a set of commonly-used framework 
libraries, and a few application-specific processors. The 
Lindi framework is a library of processors that keep no 
state between logical times, with similar functionality to 
Spark Ha plus native support for iteration. Differential 
Dataflow im is a general-purpose library for incremen¬ 
tal iterative computation, in which processors generally 
keep state to allow them to respond quickly to updates. 
As we explain in the following, we have added appropri¬ 
ate checkpointing and logging to all the Lindi and Dif¬ 
ferential Dataflow processors, as well as hooks to make 
it easy to add fault tolerance to custom processors. 
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f(p} — M[ei, f (p)') (a) Sequence numbers 

= M{e^.f{q)) U M{e,.f{q)) 

fiq) = M{e 2 ,f{.q)) f{x) = M{es,f{.q)) 



5 ( 62 ,/) = 5 ( 64 -/) = 0 
5(63,/) = 5(65,/)=/ 

(b) Epochs 


N(;n = 0 


/(<?) = {(!)} /« = {(!)} 

(tk —KE) 

/(p) = {(1)} /W = 0 /(y) = 0 



'/(e3)(/(p)) =-*■ {(63,6)} 

1/(64)(/(<?)) =-1- {(64,8)} 

'/(65)(/(r))=i{(65,5)} 

W( 6 i,/(p)) =J. {(61,2)} 
M(62,/(<?)) =i{(62,5)} 
M(63,/(r))=i{(63,4)} 

M(64,/(r))=i{(64,8)} 
M(65 ,/(z)) =J. {(65,3)} 

5(-,/) = 0 N(-,/) = 0 


/(p) {(1)} 

(c) structured times in a loop 

N(:f)=f_ 

Die^.n = Die^.n = 0 

5(62,/) = 5(65,/) = 5(6e,/) = / 


/(<?) =i {(1,4)} 

-M 



'/( 62 )(i {(t)})=U(t,“)} 

,^(63)(1 {(£,c)})=l{(t)}\{(t)} 

<Pie,)(n = f 
HesXn = f 

<PieeKl {(£,c)})=l{(t,c + l)} 


/(r) = 0 

0 

fix) =1 {(1,3)} 

M(6i,/(p)) =i{(l)} 
M(62,/(<?))=i{(l,l)} 

M(63,/(r)) = 0 
M(e„f{x)) =1{(1,3)} 
M{es.f(y)) =i{(l,3)} 
M(6e,/(<?)) =i{(l,4)} 


Figure 7; Some examples of rollback. Panel (a) shows a system based on sequence numbers. Processor x has failed. 
All processors log all outputs (D{-,f) = 0) and there are no notifications. All processors roll back to a state where they 
have sent at least as many messages as their upstream processors have consumed. Panel (b) shows a system based on 
epochs, similar to Spark Ha, where y has failed. Processor p acts like a Spark Resilient Distributed Dataset (RDD) 
and has logged all its outputs; no other processors have saved any state. Both x and y must roll back to their initial 
state, while p, q and r do not need to roll back. Panel (c) shows a system like Naiad with a loop, where y has failed. 
Processor q logs its sent messages, but no other processors do. Processor p sends messages into the loop along 62 
and q sends them out of the loop along 63. Processor q increments the loop counter coordinate of the time of each 
message it receives on and then forwards it on eg. As a result, q can roll back to },{(1,4)} even though y rolls back 
to 4,{(1,3)}. Thus q re-sends its logged messages at time (1,4) on 64, “restarting” the processing in the loop. 


4.1 Logging and checkpointing support 

For simplicity, for checkpointing purposes we impose the 
lexicographic (total) ordering on all Naiad logical times 
at a given processor. Since logical times at a processor 
are totally ordered a frontier can be summarized by a sin¬ 
gle largest element, and frontiers are also totally ordered. 

Naiad already requires that messages are serializable 
in order to support distributed operation. Any processor, 
with no additional programming effort, can request that 
the system log all of its delivered messages and notifi¬ 
cations; i.e., its full history H{p) in the notation of Sec¬ 
tion 1^ This gives any deterministic processor without 
external side-effects full fault tolerance with no software¬ 
engineering effort: it can be automatically rolled back 
to any frontier by replaying the filtered history and for¬ 
warding any resulting messages that are needed by down¬ 
stream processors after their rollback. This is a good fall¬ 
back option, but the history grows without bound so it is 
not suitable for long-running streaming applications. 

The system can automatically keep track of N, M and 
D for any processor. A processor can elect to log some 
or all sent messages, again with no additional program¬ 
ming effort. A processor can also declare that it keeps no 


state between logical times, and we call such a processor 
“stateless” even though it may accumulate state within 
a time. Alternatively it can elect to receive checkpoint 
callbacks. If such a processor requests a notihcation for 
time t then it may selectively checkpoint its state up to 
t after the notification has been processed. Stateful pro¬ 
cessors are also periodically (lazily) informed when new 
times become complete, and can choose to selectively 
checkpoint based on local policy. 

We identify all Lindi processors as stateless, and by 
default suppress logging of sent messages meaning that 
the processors incur no fault tolerance overhead. A par¬ 
ticular instance of a processor may be told by an appli¬ 
cation developer to log its sent messages, in which case 
it behaves like a Spark RDD and acts like a “firewall” 
preventing upstream processors from rolling back in the 
event of a downstream failure. 

We have added selective incremental checkpointing 
to all Differential Dataflow processors that keep state. 
Since the state is internally stored differentiated by logi¬ 
cal time, this was straightforward. 
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4.2 Garbage collection 

A fault tolerance design that targets practical streaming 
systems must address the issue of garbage-collecting per¬ 
sisted state, since it will otherwise grow indefinitely. Let 

{M{dJ) : d e Ine{p)},{D{eJ ): e S Oute{p)}] 

be the metadata about the checkpoint needed for the roll¬ 
back algorithm. 

Each time a processor p receives an acknowledgement 
from storage that Z{p,f), S{p,f) and L{p,f) have all 
been persisted for some /, it sends Z{p,f) to a moni¬ 
toring service. This service keeps track of F* (p) for all 
processors in the system. It starts with F* {p) = 0 and up¬ 
dates it every time it receives new metadata. The moni¬ 
tor runs an incremental implementation of the fixed point 
algorithm of Figure [^in a local Naiad runtime indepen¬ 
dent of the main application. When an update arrives the 
algorithm determines the new maximum rollback fron¬ 
tier at every processor given the persisted checkpoints. 
We assume that storage is reliable, so this rollback fron¬ 
tier is a low-watermark: the processor will never need to 
roll back beyond it in any failure scenario. Every time 
the low-watermark frontier at p increases to / the mon¬ 
itoring service informs p, which is at liberty to garbage- 
collect Z{p,f) and S{p,f ) for any f C /. Processors q 
that send to p are also notified, and can discard any mes¬ 
sages in L(e, •) with times in / for e € Ing{p). Since the 
monitoring service is deterministic, monotonic, and used 
only for garbage collection, it could easily be replicated 
though our prototype does not do this. 

4.3 Inputs and outputs 

The fault-tolerance properties of a streaming system can 
only be considered in the context of its streaming inputs 
and outputs. We assume that the services producing and 
consuming streams support fault tolerance via acknowl¬ 
edgement and retry. For an input, this means that the 
service will keep a batch of data available, and re-send 
if requested, until the batch has been acknowledged. For 
an output this means that we must be willing to re-send 
a batch of data multiple times until it is acknowledged 
by the recipient. These assumptions are compatible with 
services such as Kafka lO and Azure Event Hubs ifTl . 

Input and output acknowledgements can be handled by 
our existing garbage-collection mechanism. Processors 
that read external inputs are marked as stateless. Once 
such a processor is informed by the monitor that it will 
never need to roll back beyond a frontier / it can ac¬ 
knowledge all inputs ingested at times in /. A processor 
that sends external outputs is marked stateful but saves no 
checkpoints; instead it tells that monitor that / has been 


persisted once the external service has acknowledged all 
records sent at times in /, at which point the rest of the 
system may discard state that would be needed to regen¬ 
erate those output records. We can use this mechanism to 
construct a stateless pipeline in which input records are 
only acknowledged once outputs have been consumed; 
or by adding persistent state in the pipeline we can de¬ 
couple input receipt from output acknowledgement. 

4.4 Recovery from failure 

A processor p typically discovers the failure of another 
processor q by the failure of a network connection to 
a remote computer. When this happens p continues to 
work, buffering output to q in case the connection is 
reestablished. When q's failure is confirmed by a fail¬ 
ure detector, the system pauses all processors and uses 
the monitoring service to determine appropriate rollback 
frontiers. All non-failed processors p have T temporar¬ 
ily added to F*{p), and the incremental algorithm com¬ 
putes the maximal frontiers needed for rollback given the 
failed processors. A non-failed processor with a frontier 
earlier than T can typically roll back by discarding in¬ 
memory state rather than restoring from stable storage. 
Any needed logged messages Q'{e) are placed in appro¬ 
priate output queues, and the processors are restarted. 
With some additional work Naiad could be modified 
to allow pipelines of non-failed processors to continue 
without pausing. 

5 Conclusions 

We present a new framework for rollback recovery, suit¬ 
able for high throughput streaming systems. We show 
a general mechanism to determine a globally consistent 
state given a collection of local checkpoints and logs or¬ 
ganized in terms of logical times, and information about 
the local behavior of processors that constrains what log¬ 
ical times may be assigned to messages sent in response 
to events. The generality of the mechanism makes it pos¬ 
sible for processors to use flexible local policies to decide 
when to take checkpoints, and as a result get substantial 
performance and software engineering benefits. 
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